Place an X in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).
In this project, we investigate the relationship between health, education, and home security with crime rates and conclude that there is little correlation between health, education, home security with crime rates. We built linear regression lines using ordinary least-squares and visualized them using a scatter plot. By looking at the regression line and visualization, it turns out that crime rates have a weak negative correlation with health, education, and home security respectively. We then applied linear regression, radial basis function from support vector regression, polynomial from support vector regression, ridge, and poisson regression and concluded there is no relationship between our interests.
In Becker's Economic Theory of Crime (1968), he stated that people resort to crime only if the costs of committing the crime are lower than the benefits gained. However, it turns out that crime is much more prevalent among poor, disadvantaged neighborhoods than among wealthy and middle-class neighborhoods, even though those wealthier neighborhoods are more likely to have precious possessions. One potential reason for such an unexpected phenomenon could be found in the article "Why Disadvantaged Neighborhoods are More Attractive Targets for Burgling than Wealthy Ones." The authors Alyssa W. Chamberlain and Lyndsay N.Boggess claimed that wealthier communities have lower burglary rates because burglars tend to live further away from and are unfamiliar with wealthier neighborhoods. Instead, they are more likely to target disadvantaged neighborhoods since they lived there, which lowers the risk for burglars to commit crimes (due to the familiarity). In reality, most people want to live in a "safer" neighborhood. Most "safer" neighborhoods tend to have relatively higher rents or housing prices. This type of motivation is also one of the reasons why people always associate the crime rate with the wealth level of the community. Since an individual's wealth level influences tons of decision-making, it is often measured by various factors, including salary, education expenditure, medical spending, debts, home security system spending, and other aspects. Among the above expenditures, health, education, and home security system spending in everyday life are some of the most critical factors determining one's wealth level. Therefore, in this project, we aim to find the relationship between the wealth level of communities - more specifically, the health, education, and home security system spending of people living in those communities - and the crime rate.
Since a combination of other sub-factors determines wealth level, we decide to shrink the range from wealth level to expenditure on health, education, and home security system (as expenditure on these is often considered the most fundamental contributor to measuring one's wealth). From the report "How does Health Spending in the U.S. Compared to Other Countries," we find that the United States spent about 11,946 dollars per capita on health consumption, far beyond any other country. Japan, for example, only has 4691 dollars per capita on health consumption, which is even less than half of the United States' spending. Based on this undeniable fact that the United States is the wealthiest country and has the highest GDP contribution in the entire world, we have concluded that medical spending could reflect the wealth level. In a similar approach, we observe that education plays a decisive role in economic performance. People with higher education levels often earn higher salaries compared to those with lower degrees of education. More generally, richer countries tend to have more educated populations, which also leads to economic growth at a national level. Furthermore, there are some parallels between home security and one’s financial performance since it is obvious that wealthy communities are safer, as people living there are more likely to spend money on home security systems. As a result, we believe that wealth health can be well reflected by spending on health, education, and home security systems.
It is impossible to incorporate all the data of each sub-factor contributing to one's wealth level or determine an authoritative and exhaustive metric that fully represents the variable. We eventually decide to take health, education, and home security system spending to determine one's wealth level. We are also aware that there might be confounding variables in our study of finding the relationship between these three spending factors and the crime rate. Thus, determining such a relationship is only our first step, possibly with separate study cases for each confounding variable.
A growing body of research has shown that most people with criminal records have serious health care needs, especially with a history of mental illness or psychological distress, as well as a lack of education. As a result, this prevalence of mental illness and lack of education in the criminal justice population has led the government to adopt the thinking that better access to health care and education helps reduce crime. Apart from that, it is not uncommon to acknowledge that crimes are much more prevalent among poor, disadvantaged neighborhoods than among wealthy and middle-class neighborhoods, where home security systems are better. While studies have proven that an increase in health, education, and home security system spending would cause a crime reduction, little research has been done focusing on the relationship between the three spending factors and the crime rate of communities. Taking it as our research interest, we believe that if we could find a relationship between them, we could utilize such findings to reduce the crime rate to the greatest extent by predicting the incidence of crime in each community. Therefore, governments or organizations can enact more restrictive laws and send more police force to those communities with higher estimates.
In addition to the passive reduction of crimes, we could take advantage of this finding and solve the issue from the root. By discovering communities with significantly low health, education, and home security system spending, the government could open more treatment facilities, schools, and security offices in such areas and make related expenditures more affordable. With better access to health, education, and home security systems serving as the first step, we could gradually improve the entire community's well-being, both economically and socially.
In addition, we notice that the distribution of wealth level is not normal (rather more right-skewed), whereas the distribution of three spendings is approximately normal. Therefore, in this project, it will be reasonable for us to use a normal distribution to approximate the spending on health, education, and home security systems.
It is easy to find datasets about health, education, and home security system spending and criminal rate. Nevertheless, La Jolla is so small for us to get a large enough dataset (because most of the data are counted in a constituency). Thus, we decide to treat San Diego County as the base and choose our data within this larger range.
Esri is one of the biggest data holders in the world. While it does not create or record data, it holds data for the federal government, state government, huge corporations, organizations, and individuals. ArcGIS Online is one of its tools for visualizing and manipulating its data. Taking advantage of its USA Census data and database, which contains detailed census data for every constituency, such as different crime indices, household income, and health information, we can establish different models and data frames to visualize. From that, comparisons between multiple categories can then be utilized to reveal the relationship between health, education, and home security system spending and crime rates.
The higher the average household health / education / home security system spending the sector has, the lower the crime rate sector has.
This dataset provides information about general health care spending, educational spending, and home security system services in San Diego county in 2021. It also includes data describing the crime rate in the same year, including crimes such as murder, rape, robbery, assault, property crime, burglary, larceny, and motor vehicle theft. This dataset will be used to determine whether the community’s spending on the above three categories (general health care, education, and home security system services) is associated with (particularly with an antagonistic relationship) the crime rate in the community.
If you did not install the packages of plotly and sklearn, please run the cell below to install the packages
# install packages
!pip install plotly
!pip install sklearn
Requirement already satisfied: plotly in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (5.6.0) Requirement already satisfied: six in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from plotly) (1.16.0) Requirement already satisfied: tenacity>=6.2.0 in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from plotly) (8.0.1) Requirement already satisfied: sklearn in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (0.0) Requirement already satisfied: scikit-learn in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from sklearn) (0.24.2) Requirement already satisfied: scipy>=0.19.1 in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.7.1) Requirement already satisfied: joblib>=0.11 in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.1.0) Requirement already satisfied: numpy>=1.13.3 in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.20.3) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/jackyhu/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn->sklearn) (2.2.0)
To better perform our data analysis task and answer our research question, additional functionalities outside what is included in Python by default are required. We import the following useful packages using their common shortened names (i.e., patsy, NumPy, pandas, seaborn, etc.).
# Import packages
import patsy
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
# Import packages
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import PoissonRegressor
from sklearn.svm import SVR
# Statical package
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
# Panda Version package
from pandas import to_datetime
from pandas import DatetimeIndex
from pandas import Period
# Configure libraries
sns.set(context = 'talk', style='white')
The below script will get us the dataset needed to run this Jupiter notebook. Since we have a CSV file ready on Github for our Spending-Crime Data, we can directly import the data file using the URL. We import our dataset into a DataFrame called df.
# Import the data
df = pd.read_csv('https://raw.githubusercontent.com/RecluseHermit/COGS-108-Group-31-Data-Set/main/ArcGIS-Spendings-Crime-Data.csv')
Since our research questions focus on whether there is a relationship between health, education, and house security spending and all different categories of crime rate in all 535 valid FSIP areas in San Diego County, we are only interested in variables relative to health, education, housing security, and crime. Thus, the first step in data cleaning is only to extract or include information about these four significant variables. As a result, we have decided to remove other irrelevant or useless columns, including county, state, ID, country code, etc.
# Drop useless columns
df = df.drop(columns = ['OBJECTID', 'FIPS', 'SQMI',
'COUNTY', 'STATE', 'Id', 'Country code', 'ENRICH_FID',
'Aggregation method', 'Population to polygon size rating for the country',
'Apportionment confidence for the country', 'Has data',
'Invalid1', 'Invalid2'])
After removing all missing information, we look at all of the columns. Since some of the variable names are relatively long compared to others, we decided to rename all of them into a more standard form. We create a function called col_clean (with input parameter as a string) that helps us standardize the messy column titles. After applying the standardized method, our new column names are presented in lowercase letters with the underline separating each word.
# Drop NaN
df = df.dropna(subset = ['POP2020'])
df = df.drop(columns = ['POP2020'])
After observing the current dataset (excluded all useless columns), we have found that there exist some missing values under the POP2020 column. Since there is no population identified in those areas with missing values, it is not meaningful to include such areas with its corresponding values (health, education, and housing security) spending information. Therefore, to prevent them from leading biases and outliers in our analysis output, we have decided to remove all the rows that contain missing values. After this, we do the POP2020 column anymore, so we droped it.
# Title Cleaning
def col_clean(str_in):
str_in = str_in.lower()
str_in = str_in.strip()
# since all our data is from 2021, we remove 2021 from the title of data
str_in = str_in.replace('2021', '')
str_in = str_in.replace('2020', '')
# remove ':' from different data of spendings and crimes
str_in = str_in.replace(':', '')
str_in = str_in.strip()
# we turn title into snake variable
str_in = str_in.replace(' ', '_')
str_in = str_in.strip()
# return after clean title
return str_in
# Apply Cleaning
new_columns = df.columns
new_columns_name = []
# Create new list to put the after clean title in
for title in new_columns:
new_columns_name.append(col_clean(title))
# New Title
df.columns = new_columns_name
We check our dataframe again.
# Check the Dataframe
df.head()
| health_care | avg_health_care | index_health_care | education | avg_education | index_education | avg_home_security_system_svcs | index_home_security_system_svcs | home_security_system_svcs | total_crime_aggregate | ... | total_crime_index | personal_crime_index | murder_index | rape_index | robbery_index | assault_index | property_crime_index | burglary_index | larceny_index | motor_vehicle_theft_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17324308.0 | 12947.91 | 208.0 | 5712349.0 | 4269.32 | 247.0 | 97.11 | 227.0 | 129939.0 | 274843.0 | ... | 91.0 | 51.0 | 26.0 | 21.0 | 75.0 | 46.0 | 98.0 | 66.0 | 98.0 | 157.0 |
| 1 | 9111558.0 | 8603.93 | 138.0 | 2625055.0 | 2478.81 | 144.0 | 58.18 | 136.0 | 61609.0 | 258390.0 | ... | 135.0 | 74.0 | 11.0 | 89.0 | 101.0 | 63.0 | 145.0 | 80.0 | 154.0 | 206.0 |
| 2 | 14528683.0 | 6518.03 | 105.0 | 4635862.0 | 2079.79 | 121.0 | 40.67 | 95.0 | 90644.0 | 310270.0 | ... | 74.0 | 32.0 | 4.0 | 96.0 | 13.0 | 29.0 | 82.0 | 80.0 | 83.0 | 72.0 |
| 3 | 14323984.0 | 6131.84 | 98.0 | 5232461.0 | 2239.92 | 130.0 | 34.02 | 79.0 | 79464.0 | 584307.0 | ... | 146.0 | 88.0 | 12.0 | 87.0 | 110.0 | 81.0 | 155.0 | 92.0 | 169.0 | 167.0 |
| 4 | 11593958.0 | 6748.52 | 108.0 | 4449809.0 | 2590.11 | 150.0 | 37.90 | 89.0 | 65108.0 | 270318.0 | ... | 95.0 | 38.0 | 4.0 | 91.0 | 23.0 | 36.0 | 104.0 | 92.0 | 101.0 | 153.0 |
5 rows × 29 columns
We want to have an overview of each varaible in df dataset, thus we use the describe method to get the descriptive statistics for all variables.
# Summarize the data in the dataset
df.describe()
| health_care | avg_health_care | index_health_care | education | avg_education | index_education | avg_home_security_system_svcs | index_home_security_system_svcs | home_security_system_svcs | total_crime_aggregate | ... | total_crime_index | personal_crime_index | murder_index | rape_index | robbery_index | assault_index | property_crime_index | burglary_index | larceny_index | motor_vehicle_theft_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.350000e+02 | 535.000000 | 535.000000 | 5.350000e+02 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 5.350000e+02 | ... | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 | 535.000000 |
| mean | 1.069791e+07 | 6649.135944 | 106.620561 | 3.495777e+06 | 2182.223570 | 126.452336 | 45.782168 | 106.979439 | 73503.306542 | 3.483117e+05 | ... | 77.330841 | 88.708411 | 51.226168 | 80.600000 | 93.085981 | 89.071028 | 75.465421 | 72.439252 | 68.042991 | 137.816822 |
| std | 6.197890e+06 | 2880.753515 | 46.163769 | 2.122106e+06 | 1020.589375 | 59.131013 | 23.917863 | 55.881408 | 47656.958581 | 2.265167e+05 | ... | 44.592127 | 73.073106 | 55.874888 | 61.957071 | 96.885566 | 77.619417 | 42.839503 | 31.992560 | 46.754679 | 105.360857 |
| min | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.120000e+02 | ... | 7.000000 | 6.000000 | 3.000000 | 5.000000 | 2.000000 | 6.000000 | 7.000000 | 4.000000 | 5.000000 | 4.000000 |
| 25% | 6.360818e+06 | 4513.035000 | 72.000000 | 2.008627e+06 | 1442.630000 | 83.500000 | 28.990000 | 68.000000 | 40684.000000 | 1.845465e+05 | ... | 46.000000 | 31.500000 | 12.000000 | 37.000000 | 19.000000 | 28.000000 | 46.000000 | 50.000000 | 35.000000 | 53.000000 |
| 50% | 9.201066e+06 | 6071.890000 | 97.000000 | 2.949414e+06 | 1991.820000 | 115.000000 | 40.130000 | 94.000000 | 63956.000000 | 3.122450e+05 | ... | 71.000000 | 67.000000 | 30.000000 | 66.000000 | 60.000000 | 69.000000 | 69.000000 | 68.000000 | 58.000000 | 122.000000 |
| 75% | 1.433878e+07 | 8109.760000 | 130.000000 | 4.627200e+06 | 2662.085000 | 154.500000 | 58.090000 | 136.000000 | 94893.000000 | 4.568370e+05 | ... | 100.000000 | 125.000000 | 70.000000 | 111.000000 | 142.000000 | 128.500000 | 97.500000 | 92.500000 | 87.500000 | 190.500000 |
| max | 4.263254e+07 | 16781.800000 | 269.000000 | 1.653481e+07 | 6508.730000 | 377.000000 | 132.480000 | 310.000000 | 336554.000000 | 1.569585e+06 | ... | 296.000000 | 366.000000 | 278.000000 | 547.000000 | 472.000000 | 389.000000 | 293.000000 | 191.000000 | 344.000000 | 522.000000 |
8 rows × 29 columns
After briefly cleaning the dataset, we apply the plot the overall distribution for each variable of our interest. By plotting histograms for average health care spending, average education spending, average home security system services spending, and total crime aggregate, we could have a better idea of what the distributions look like visually. The graph below shows that all four distributions are approximately normal, but a little bit skew to right.
# check the distributions and see if there are any outliers
plt.rcParams['figure.figsize'] = [20, 20]
f, axes = plt.subplots(2,2)
# check the distribution of health care average
sns.histplot(data = df, x = "avg_health_care", stat = "count", ax = axes[0][0])
# check the distribution of education average
sns.histplot(data = df, x = "avg_education", stat = "count", color = '#b3cde3', ax=axes[0][1])
# check the distribution of home security system svcs average
sns.histplot(data = df, x = "avg_home_security_system_svcs", stat = "count", color = '#88419d', ax = axes[1][0])
# check the distribution of total crime aggregate
sns.histplot(data = df, x = "total_crime_aggregate", stat = "count", color = '#cbc9e2', ax = axes[1][1])
# show plot
plt.show()
As we can see from the graphs above, there are some outliers on the far right end of the distributions for all histograms. We define the data as outliers if they are at least two standard deviations away from the mean. While outliers can sometimes be very informative about the data collection process, they can also increase the variability in our data, which leads to a decrease in the statistical power. Furthermore, removing outliers could reduce the least square error in our regression analysis, causing our results to become more valid. Due to the above justifications, we have found the outliers of avg_health_care, avg_education, avg_home_security_system_svcs, and total_crime_aggregate and removed them from our dataset. We remove the outliers from crime aggregate because we lack the data of the average for crime rate.
# From the graph, it seems that there are some outliers on the right of the distributions.
# We define outlier if it is 2 standard deviations away the mean and remove them
df = df[df['avg_health_care'] <= df.get("avg_health_care").mean() + 2 * np.std(df.get("avg_health_care"))]
df = df[df['avg_health_care'] >= df.get("avg_health_care").mean() - 2 * np.std(df.get("avg_health_care"))]
df = df[df['avg_education'] <= df.get("avg_education").mean() + 2 * np.std(df.get("avg_education"))]
df = df[df['avg_education'] >= df.get("avg_education").mean() - 2 * np.std(df.get("avg_education"))]
df = df[df['avg_home_security_system_svcs'] <= df.get("avg_home_security_system_svcs").mean()
+ 2 * np.std(df.get("avg_home_security_system_svcs"))]
df = df[df['avg_home_security_system_svcs'] >= df.get("avg_home_security_system_svcs").mean()
- 2 * np.std(df.get("avg_home_security_system_svcs"))]
df = df[df['total_crime_aggregate'] <= df.get("total_crime_aggregate").mean()
+ 2 * np.std(df.get("total_crime_aggregate"))]
df = df[df['total_crime_aggregate'] >= df.get("total_crime_aggregate").mean()
- 2 * np.std(df.get("total_crime_aggregate"))]
# graph the aftermath
plt.rcParams['figure.figsize'] = [20, 20]
f, axes = plt.subplots(2,2)
# check the distribution of health care
sns.histplot(data = df, x = "health_care", stat = "count", ax = axes[0][0])
# check the distribution of education
sns.histplot(data = df, x = "education", stat = "count", color = '#b3cde3', ax=axes[0][1])
# check the distribution of home security system svcs
sns.histplot(data = df, x = "home_security_system_svcs", stat = "count", color = '#88419d', ax = axes[1][0])
# check the distribution of total crime aggregate
sns.histplot(data = df, x = "total_crime_aggregate", stat = "count", color = '#cbc9e2', ax = axes[1][1])
# show plot
plt.show()
After removing the outliers, we perform data standardization that puts different variables on the same scale and allows us to do further analysis. Following the standardized test statistic for z-scores, we have transformed each value into the value itself subtracted by the mean and then divided by the standard deviation. In the end, we assign these new values back into t_standardized_data and save them for future use.
# From the graphs, we could see that the distribution of health care, education, home security system svcs,
# and total crime aggregate are approximately normal, though a bit right skewed. Besides, the counts are more
# than 500. Thus, we standardize the health care, education, home security system svcs, and total crime aggregate.
def standardized_units(data):
mean_value = data.mean()
std_value = np.std(data)
lst = []
for each_data in data:
lst.append((each_data - mean_value) / std_value)
return lst
# apply the standardization
t_standardized_data = pd.DataFrame().assign(
t_standardized_healthcare = standardized_units(df.get("health_care")),
t_standardized_education = standardized_units(df.get("education")),
t_standardized_home_security = standardized_units(df.get("home_security_system_svcs")),
t_standardized_crime = standardized_units(df.get("total_crime_aggregate")),
)
# check the standardized data
t_standardized_data.head()
| t_standardized_healthcare | t_standardized_education | t_standardized_home_security | t_standardized_crime | |
|---|---|---|---|---|
| 0 | -0.112534 | -0.308690 | -0.078097 | -0.482716 |
| 1 | 0.999064 | 0.947552 | 0.745534 | -0.182040 |
| 2 | 0.957060 | 1.320275 | 0.428393 | 1.406172 |
| 3 | 0.396857 | 0.831317 | 0.021159 | -0.413586 |
| 4 | 0.131979 | 0.541997 | -0.328320 | -0.073992 |
Since we already cleaned our dataset by removing all irrelevant columns, missing values, and outliers as well as performing the z-score standardization in the Data Cleaning section, now we can finally start analyzing our data.
To better visualize our clean data on health care spending, education spending, home security system service spending, and total crime aggregate, we apply the sns.histoplot command to plot the overall distribution for each of them.
However, as we have observed from the previous four histograms, the range of the x-axis appears to be extremely large. (For health care spending, the range lies in between 0 to 25,000,000 dollars; for education spending, the range lies in between 0 to 8,000,000 dollars; for home security system spending, the range lies in between 0 to 175,000 dollars; for total crime aggregate, the range lies in between 0 to 800,000).
Therefore, to make the histograms more reader-friendly, we decide to rescale the x-axis by changing the units. For the health_care_standardized histogram, we rescale the x-axis with units in 1 million dollars. For the education_standardized histogram, we rescale the x-axis with units in 1,000 dollars. For the home_security_standardized histogram, we rescale the x-axis with units in 1,000 dollars. For the aggregate_crime_standardized histogram, we rescale the x-axis with units in 1,000.
After plotting the four histograms, we can observe that they now look much more readable compared to the ones before applying the unit change. Meanwhile, since we’re using the same dataset (with a slight change in units), the graphs still preserve the same approximately normal distributions. From a statistical point of view, this normal distribution makes sense due to the central limit theorem, a probability theory which states that the distribution of a sample variable approximates a normal distribution as the sample size becomes larger. In our case, since the sample size is large enough, the central limit theorem indeed holds.
# Plot distribution graph
# Assign standardized value to the dataframe
plt.rcParams['figure.figsize'] = [10, 10]
df = df.assign(
health_care_standardized = df.get("health_care") / 1000000,
education_standardized = df.get("education") / 1000,
home_security_standardized = df.get("home_security_system_svcs") / 1000,
aggregate_crime_standardized = df.get("total_crime_aggregate") / 1000
)
# Plot healthcare distribution graph
fig_healthcare = sns.histplot(data = df, x = "health_care_standardized", stat = "count", bins = 20)
fig_healthcare.set_xlabel("health care with units 1M")
fig_healthcare.set(title = "Histogram of Health Care")
plt.show()
# Plot education distribution graph
plt.rcParams['figure.figsize'] = [10, 10]
fig_education = sns.histplot(data = df, x = "education_standardized", stat = "count", color = '#b3cde3')
fig_education.set_xlabel("education with units 1k")
fig_education.set(title = "Histogram of Education")
plt.show()
# Plot homesecurity distribution graph
plt.rcParams['figure.figsize'] = [10, 10]
fig_homesecurity = sns.histplot(data = df, x = "home_security_standardized", stat = "count",color = '#88419d')
fig_homesecurity.set_xlabel("home_security with units 1k")
fig_homesecurity.set(title = "Histogram of Home Security System")
plt.show()
# Plot aggregate crime distribution graph
plt.rcParams['figure.figsize'] = [10, 10]
fig_aggregate_crime = sns.histplot(data = df, x = "aggregate_crime_standardized", stat = "count", color = '#cbc9e2')
fig_aggregate_crime.set_xlabel("aggregate_crime with units 1k")
fig_aggregate_crime.set(title = "Histogram of Aggregate Crime")
plt.show()
While the above figures provide us with a brief visualization of our dataset by displaying the overall distributions of the major four variables of interest (health care spending, education spending, home security system service spending, total crime aggregate), we still need further analysis to investigate the relationship between them.
As we learned in class, scatterplots are best used to determine whether or not two variables have a relationship or correlation. Besides, since all of our values are ratio scales, we can then pick two variables each time and plot them on a scatter diagram to view their relationship. Recall that our research question focuses on whether there is a relationship between health, education, and house security system service spending and all different categories of crime rate in all 535 valid FSIP areas in San Diego County. That is, we are mostly interested in the correlation between each of the three spending variables and the crime rate.
Thus, in the below script, we apply the sns.regplot command to generate three scatterplots (each plot with a corresponding least-squared regression line) that indicate the relationship between each factor and the total crime aggregate: health care spending vs. total crime aggregate, education spending vs. total crime aggregate, and home security system spending vs. total crime aggregate.
# Scatterplot of the aggregate values
plt.rcParams['figure.figsize'] = [14, 17]
# Extract the columns that we want to analyze
overall_total = df[['health_care', 'education', 'home_security_system_svcs', 'total_crime_aggregate']]
f, axes = plt.subplots(3,1)
f.suptitle('Relationship between each factors and total crime aggregate', fontsize=20)
factors_total = ['health_care', 'education', 'home_security_system_svcs']
# Plotting
sns.regplot(y='total_crime_aggregate', x=factors_total[0], data=overall_total, ax=axes[0])
sns.regplot(y='total_crime_aggregate', x=factors_total[1], data=overall_total, ax=axes[1], color = '#b3cde3')
sns.regplot(y='total_crime_aggregate', x=factors_total[2], data=overall_total, ax=axes[2], color = '#88419d')
plt.show()
After taking a closer look at each scatterplot, though the points seem to be random, by drawing the regression line, we observe a slight negative slope between each of the three factors and the total crime aggregate. This implies that the more spending on health care, education, and home security system in a valid FSIP area in San Diego County, the less total crime aggregate in that same area. This discovery also makes sense intuitively because, in reality, crime is much more prevalent among poor, disadvantaged neighborhoods (with less spending on health care, education, and home security system) than among wealthy and middle-class neighborhoods (with more spending on health care, education, and home security system). With such background knowledge, we could verify that our conclusion drawn from the below scatterplots makes sense.
From the three OLS regression model, we find that though all three graphs indicate negative slope, the R-squared for all three models are very small.
To extract more evidence that supports our previous claim, we would like to further investigate the correlation between each of the three factors and the total crime aggregate. This time, we use the standardized data from the t_standardized_data created in the Data Cleaning section.
Similar to the linear pattern in the above scatterplots on “unstandardized” data, we again observe a slight negative slope between each of the three factors and the t standardized crime. If we take a more careful look at each scatterplot, we notice that the slopes of the regression lines of corresponding scatterplots are exactly the same (unstandardized data vs. standardized data) since we use the same dataset only with a rescale.
# Plot after standardized data
plt.rcParams['figure.figsize'] = [14, 17]
# Extract the columns that we want to analyze
f, axes = plt.subplots(3,1)
f.suptitle('Relationship between each factors and total crime aggregate',
fontsize=20)
factors_standardized = ['t_standardized_healthcare', 't_standardized_education', 't_standardized_home_security']
# Plotting
sns.regplot(y='t_standardized_crime', x=factors_standardized[0], data=t_standardized_data, ax=axes[0])
sns.regplot(y='t_standardized_crime', x=factors_standardized[1], data=t_standardized_data, ax=axes[1], color = '#b3cde3')
sns.regplot(y='t_standardized_crime', x=factors_standardized[2], data=t_standardized_data, ax=axes[2], color = '#88419d')
plt.show()
From the three OLS regression model, we find that though all three graphs indicate negative slope, the R-squared for all three models are very small.
Next, we will further investigate the correlation between each of the three factors and the total crime aggregate using index values provided in the dataset. The index for each variable provides a view of the relative proportion of the value with respect to the entire population. For example, the overall crime index is based on the crime rate per certain (e.g. 10,000) population for all crimes in a specific area. Compared to the linear pattern in the scatterplots above, we now observe an even stronger negative correlation between each of the three factors and the total crime index, meaning that one unit increase in index of health care, education, or home security would mostly likely to decrease total crime index. Thus, plotting with the index values makes the relationship more distinct.
plt.rcParams['figure.figsize'] = [14, 17]
# Extract the columns that we want to analyze
overall = df[['index_health_care', 'index_education', 'index_home_security_system_svcs', 'total_crime_index']]
f, axes = plt.subplots(3,1)
f.suptitle('Relationship between each factors and total crime aggregate',
fontsize=20)
factors_index = ['index_health_care', 'index_education', 'index_home_security_system_svcs']
sns.regplot(y='total_crime_index', x=factors_index[0], data=overall, ax=axes[0])
sns.regplot(y='total_crime_index', x=factors_index[1], data=overall, ax=axes[1], color = '#b3cde3')
sns.regplot(y='total_crime_index', x=factors_index[2], data=overall, ax=axes[2], color = '#88419d')
plt.show()
We also make three OLS regressions for index of health care v.s. total crime index, education index v.s. total crime index, and home security index v.s. total crime index. We found that the slope is more negative and R-square is higher than previous models. This implies that the models with index is more reliable compared with previous two.
The above nine scatterplots all display a solidly negative correlation between each of the three factors (health care, education, and home security system service spending) and the crime rate. As a result, it’s reasonable for us to draw the conclusion that the more spending on health care, education, and home security system in a valid FSIP area in San Diego County, the less total crime aggregate in that same area.
While we have reached the general conclusion as stated above, it is even better for us to further investigate how each of the three factors (health care, education, and home security system service spending) contributes to the nine different crime aggregates (i.e. murder, personal crime, rape, robbery, assault, property crime, burglary, larceny, and motor vehicle theft). To accomplish this, we will use line plots.
#implement the function that is used for split each datapoint into a interval
def split_crime(data):
mean_value = data.sum() / len(data)
std_value = np.std(data)
std_value_divided_4 = std_value / 4
lst = []
for each in data:
temp = each - mean_value
if temp < 0:
temp += 1
lst.append((int(temp / std_value_divided_4) + 0.5) * std_value_divided_4 + mean_value)
return lst
Line charts are best when we want to show how certain value changes over time or compare how several things change over time relative to each other. That being said, since we are trying to investigate how the nine different crime aggregates are impacted by health care spending, education spending, and home security system spending, we can take advantage of the nice properties of the line chart. We can set each of the spending factors as the x-axis and the crime aggregates as the y-axis, thereby plotting nine lines with different colors (each representing a specific type of crime) on the chart.
However, we will encounter a problem while plotting the line chart. Since we have a relatively large dataset, it contains tons of unique spending values, which represent the x values in the line chart. As a result, if we use our original data to plot the line chart, it will almost look like an abnormal ECG with more than 500 data points connecting together. In other words, the chart will be full of details such that we are unable to perceive the overall trend of each line at all.
To solve this problem, we decide to write a function called split_crime that helps us round down the values on the x-axis and eventually reduce the number of x values. To put it more simply, we try to combine several x values that are close to each other and make them share the same y value. To achieve this, we set each x value to be equal to itself subtracted by the mean, divided by the standard deviation over 4, added by 0.5, multiplied with the standard deviation over 4, and finally added by the mean.
#reassign the columns into dataframe.
temp = df.get("health_care")
df = df.assign(health_care_roundown = split_crime(temp))
temp = df.get("education")
df = df.assign(education_roundown = split_crime(temp))
temp = df.get("home_security_system_svcs")
df = df.assign(home_security_system_svcs_roundown = split_crime(temp))
temp = df.get("index_health_care")
df = df.assign(index_health_care_roundown = split_crime(temp))
temp = df.get("index_education")
df = df.assign(index_education_roundown = split_crime(temp))
temp = df.get("index_home_security_system_svcs")
df = df.assign(index_home_security_system_svcs_roundown = split_crime(temp))
#Implement function which draws the lineplot for total crime aggregate and spending.
def draw_graph_with_aggregate(spending):
# drawing the lineplot to display relationship between healthcare and 9 crimes in total
plt.rcParams['figure.figsize'] = [10, 10]
p = sns.lineplot(x = spending, y = "murder_aggregate", data = df, ci = None)
# 9 specific crimes
sns.lineplot(x = spending, y = "personal_crime_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "rape_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "robbery_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "assault_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "property_crime_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "burglary_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "larceny_aggregate", data = df, ci = None)
sns.lineplot(x = spending, y = "motor_vehicle_theft_aggregate", data = df, ci = None)
p.set_ylabel("9 crimes aggregate", fontsize = 20)
plt.legend(labels=["murder","personal crime", "rape", "robbery", "assault", \
"property crime", "burglary", "larceny", "motor vehicle theft"], loc = "upper right")
plt.show()
#Health care and total crime aggregate graph
draw_graph_with_aggregate("health_care_roundown")
#Education and total crime aggregate graph
draw_graph_with_aggregate("education_roundown")
#Home security and total crime aggregate graph
draw_graph_with_aggregate("home_security_system_svcs_roundown")
After rescaling the x values, we apply the sns.lineplot command to plot the line chart for each of the three factors (health care, education, and home security system service spending) over the nine different crime aggregates. As shown below, all three charts share similar patterns: crimes including motor vehicle theft, robbery, assault, personal crime, and murder all achieve their peak aggregates at approximately the same place (around 6000000 dollars spending on health care, around 2,000,000 dollars spending on education, around 27,000 dollars spending on home security system service) and then gradually decrease as spending increases. This finding matches our conclusion drawn from the scatterplots (the more spending on health care, education, and home security system in a valid FSIP area in San Diego County, the less total crime aggregate in that same area).
To further investigate the impact of each of the three factors (health care, education, and home security system service spending) on the nine different crime aggregates, we perform the same visual analysis (i.e. plotting line charts) using the 9 crime index value. To correspond to the crime index, we used the index of health care, the index of education, and the index of security system service. This time, we have found a much stronger decreasing trend between our factors and crime index. Similar to the previous three graphs, we found that the motor vehicle yields the highest crime index among the three graphs, and the murder index yields the lowest crime index among the three graphs. Overall, when health care, education, and home security system spending increases, all nine different types of crime index decrease, despite the slower decreasing trend.
#Implement function which draws the lineplot for total crime index and spending index.
def draw_graph_with_index(spending):
plt.rcParams['figure.figsize'] = [10, 10]
# 9 specific crimes
p4 = sns.lineplot(x = spending, y = "murder_index", data = df, ci = None)
sns.lineplot(x = spending, y = "personal_crime_index", data = df, ci = None)
sns.lineplot(x = spending, y = "rape_index", data = df, ci = None)
sns.lineplot(x = spending, y = "robbery_index", data = df, ci = None)
sns.lineplot(x = spending, y = "assault_index", data = df, ci = None)
sns.lineplot(x = spending, y = "property_crime_index", data = df, ci = None)
sns.lineplot(x = spending, y = "burglary_index", data = df, ci = None)
sns.lineplot(x = spending, y = "larceny_index", data = df, ci = None)
sns.lineplot(x = spending, y = "motor_vehicle_theft_index", data = df, ci = None)
p4.set_ylabel("9 crimes index", fontsize = 20)
plt.legend(labels=["murder","personal crime", "rape", "robbery", "assault", \
"property crime", "burglary", "larceny", "motor vehicle theft"], loc = "upper right")
plt.show()
#Health care index and total crime index graph
draw_graph_with_index("index_health_care_roundown")
#Education index and total crime index graph
draw_graph_with_index("index_education_roundown")
#Home security index and total crime index graph
draw_graph_with_index("index_home_security_system_svcs_roundown")
Other than drawing the scatterplot, we also applied ordinary least squares regression to mathematically compute the slope and p-value. We claim that the confidence interval is 5% and the null hypothesis be slope is 0. In other words, if p-value is less than 5%, we will have sufficient evidence to reject the null hypothesis and thus claim there is a correlation between the dependent variable and the independent variable. Since we are interested in if there is a correlation between education, healthcare, and home security on crime rate, we first study if each variable will be correlated to the crime rate.
# get the intercept and slope of linear regression line between each factor and standardized crime aggregate
for i in range(len(factors_standardized)):
string = "t_standardized_crime ~ " + factors_standardized[i]
print(string)
dependent, predictor = patsy.dmatrices(string, t_standardized_data)
model = sm.OLS(dependent, predictor)
res_1 = model.fit()
print(res_1.summary())
t_standardized_crime ~ t_standardized_healthcare
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.027
Model: OLS Adj. R-squared: 0.025
Method: Least Squares F-statistic: 12.17
Date: Mon, 14 Mar 2022 Prob (F-statistic): 0.000536
Time: 21:12:27 Log-Likelihood: -619.72
No. Observations: 441 AIC: 1243.
Df Residuals: 439 BIC: 1252.
Df Model: 1
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.047 2.51e-15 1.000 -0.093 0.093
t_standardized_healthcare -0.1642 0.047 -3.488 0.001 -0.257 -0.072
==============================================================================
Omnibus: 15.593 Durbin-Watson: 1.599
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12.137
Skew: 0.311 Prob(JB): 0.00231
Kurtosis: 2.478 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
t_standardized_crime ~ t_standardized_education
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.028
Model: OLS Adj. R-squared: 0.026
Method: Least Squares F-statistic: 12.67
Date: Mon, 14 Mar 2022 Prob (F-statistic): 0.000412
Time: 21:12:27 Log-Likelihood: -619.48
No. Observations: 441 AIC: 1243.
Df Residuals: 439 BIC: 1251.
Df Model: 1
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.047 2.51e-15 1.000 -0.092 0.092
t_standardized_education -0.1675 0.047 -3.560 0.000 -0.260 -0.075
==============================================================================
Omnibus: 15.454 Durbin-Watson: 1.593
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12.154
Skew: 0.314 Prob(JB): 0.00229
Kurtosis: 2.483 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
t_standardized_crime ~ t_standardized_home_security
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.043
Model: OLS Adj. R-squared: 0.040
Method: Least Squares F-statistic: 19.51
Date: Mon, 14 Mar 2022 Prob (F-statistic): 1.26e-05
Time: 21:12:27 Log-Likelihood: -616.17
No. Observations: 441 AIC: 1236.
Df Residuals: 439 BIC: 1245.
Df Model: 1
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.047 2.53e-15 1.000 -0.092 0.092
t_standardized_home_security -0.2063 0.047 -4.417 0.000 -0.298 -0.114
==============================================================================
Omnibus: 14.172 Durbin-Watson: 1.635
Prob(Omnibus): 0.001 Jarque-Bera (JB): 11.711
Skew: 0.318 Prob(JB): 0.00286
Kurtosis: 2.517 Cond. No. 1.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since the p-values for each regression model is less than 5%, the model suggests that there exists correlations between standardized crime and standardized healthcare, standardized education, and standardized home security respectively. From the summary, the slopes for three regression models are approximately around -0.20, meaning there is a negative correlation. Besides, increasing 1 standard deviation of independent variable would decrease dependent variable by 17% of standard deviation. Thus, these models are statistically and economically significant. However, the R-square in three models is around 0.03, meaning 3% of data could be explained by these models. This makes sense because there are a lot of factors contributing to crime rate, using only one factor to determine the crime rate by linear regression is not plausible. In conclusion, though the model shows that there are slight negative correlations between standardized crime and standardized healthcare, standardized education, and standardized home security respectively, the linear regression model with a single independent variable might not be the perfect model to predict crime rate.
# get the intercept and slope of linear regression line between each factor and total crime index
for i in range(len(factors_index)):
string = "total_crime_index ~ " + factors_index[i]
print(string)
dependent, predictor = patsy.dmatrices(string, overall)
model = sm.OLS(dependent, predictor)
res_1 = model.fit()
print(res_1.summary())
total_crime_index ~ index_health_care
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.201
Model: OLS Adj. R-squared: 0.199
Method: Least Squares F-statistic: 110.3
Date: Mon, 14 Mar 2022 Prob (F-statistic): 3.55e-23
Time: 21:12:27 Log-Likelihood: -2205.1
No. Observations: 441 AIC: 4414.
Df Residuals: 439 BIC: 4422.
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 135.6613 5.749 23.599 0.000 124.363 146.960
index_health_care -0.6060 0.058 -10.504 0.000 -0.719 -0.493
==============================================================================
Omnibus: 95.703 Durbin-Watson: 1.563
Prob(Omnibus): 0.000 Jarque-Bera (JB): 206.384
Skew: 1.136 Prob(JB): 1.53e-45
Kurtosis: 5.464 Cond. No. 334.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
total_crime_index ~ index_education
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.176
Model: OLS Adj. R-squared: 0.174
Method: Least Squares F-statistic: 93.48
Date: Mon, 14 Mar 2022 Prob (F-statistic): 3.55e-20
Time: 21:12:27 Log-Likelihood: -2212.0
No. Observations: 441 AIC: 4428.
Df Residuals: 439 BIC: 4436.
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 128.1977 5.474 23.420 0.000 117.440 138.956
index_education -0.4494 0.046 -9.669 0.000 -0.541 -0.358
==============================================================================
Omnibus: 94.983 Durbin-Watson: 1.514
Prob(Omnibus): 0.000 Jarque-Bera (JB): 204.346
Skew: 1.129 Prob(JB): 4.23e-45
Kurtosis: 5.454 Cond. No. 370.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
total_crime_index ~ index_home_security_system_svcs
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.232
Model: OLS Adj. R-squared: 0.230
Method: Least Squares F-statistic: 132.6
Date: Mon, 14 Mar 2022 Prob (F-statistic): 5.40e-27
Time: 21:12:27 Log-Likelihood: -2196.4
No. Observations: 441 AIC: 4397.
Df Residuals: 439 BIC: 4405.
Df Model: 1
Covariance Type: nonrobust
===================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept 129.4922 4.775 27.118 0.000 120.107 138.877
index_home_security_system_svcs -0.5575 0.048 -11.516 0.000 -0.653 -0.462
==============================================================================
Omnibus: 98.934 Durbin-Watson: 1.634
Prob(Omnibus): 0.000 Jarque-Bera (JB): 217.651
Skew: 1.164 Prob(JB): 5.47e-48
Kurtosis: 5.535 Cond. No. 280.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We then run the linear regressions on the crime rate index with the healthcare index, education index, and home security index respectively. We found that the p-value for all three slopes are approximately 0, meaning that we are strongly confident to reject the null hypothesis. The regression models suggest the slopes are around -0.4 to -0.6, representing there are weak negative correlations in our models. Besides, increasing 1 standard deviation of the independent variable would decrease the dependent variable by 45%. Thus, these models are statistically and economically significant. Compared with the previous tests, when we used standardized aggregate data, we now have R-square at around 20%. Thus, it seems that it is more reasonable to use index data to build our model compared to standardized. We then assign two of the variables in [healthcare, education, and home security] and investigate if any combination of two independent variables will influence the crime rate.
# get the intercept and slope of linear regression line between each factor and standardized crime aggregate
for i in range(len(factors_standardized)):
string = "t_standardized_crime ~ " + factors_standardized[i] + " + " + factors_standardized[(i+1)%len(factors_standardized)]
print(string)
dependent, predictor = patsy.dmatrices(string, t_standardized_data)
model = sm.OLS(dependent, predictor)
res_1 = model.fit()
print(res_1.summary())
t_standardized_crime ~ t_standardized_healthcare + t_standardized_education
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.028
Model: OLS Adj. R-squared: 0.024
Method: Least Squares F-statistic: 6.346
Date: Mon, 14 Mar 2022 Prob (F-statistic): 0.00192
Time: 21:12:27 Log-Likelihood: -619.45
No. Observations: 441 AIC: 1245.
Df Residuals: 438 BIC: 1257.
Df Model: 2
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.047 2.5e-15 1.000 -0.093 0.093
t_standardized_healthcare -0.0386 0.177 -0.218 0.828 -0.387 0.310
t_standardized_education -0.1303 0.177 -0.734 0.463 -0.479 0.219
==============================================================================
Omnibus: 15.429 Durbin-Watson: 1.596
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12.105
Skew: 0.313 Prob(JB): 0.00235
Kurtosis: 2.483 Cond. No. 7.40
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
t_standardized_crime ~ t_standardized_education + t_standardized_home_security
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.046
Model: OLS Adj. R-squared: 0.042
Method: Least Squares F-statistic: 10.57
Date: Mon, 14 Mar 2022 Prob (F-statistic): 3.27e-05
Time: 21:12:27 Log-Likelihood: -615.35
No. Observations: 441 AIC: 1237.
Df Residuals: 438 BIC: 1249.
Df Model: 2
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.047 2.53e-15 1.000 -0.092 0.092
t_standardized_education 0.1540 0.121 1.271 0.204 -0.084 0.392
t_standardized_home_security -0.3484 0.121 -2.875 0.004 -0.587 -0.110
==============================================================================
Omnibus: 14.739 Durbin-Watson: 1.636
Prob(Omnibus): 0.001 Jarque-Bera (JB): 12.536
Skew: 0.338 Prob(JB): 0.00190
Kurtosis: 2.526 Cond. No. 4.99
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
t_standardized_crime ~ t_standardized_home_security + t_standardized_healthcare
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.078
Model: OLS Adj. R-squared: 0.074
Method: Least Squares F-statistic: 18.63
Date: Mon, 14 Mar 2022 Prob (F-statistic): 1.72e-08
Time: 21:12:27 Log-Likelihood: -607.75
No. Observations: 441 AIC: 1222.
Df Residuals: 438 BIC: 1234.
Df Model: 2
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.046 2.57e-15 1.000 -0.090 0.090
t_standardized_home_security -1.1337 0.229 -4.944 0.000 -1.584 -0.683
t_standardized_healthcare 0.9466 0.229 4.128 0.000 0.496 1.397
==============================================================================
Omnibus: 17.404 Durbin-Watson: 1.626
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17.683
Skew: 0.458 Prob(JB): 0.000145
Kurtosis: 2.649 Cond. No. 9.90
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since the p-values for regression model of t_standardized_crime ~ t_standardized_healthcare + t_standardized_education and t_standardized_crime ~ t_standardized_education + t_standardized_home_security is larger than 5%, suggesting that we do not have enough evidence to reject null hypothesis. Thus we cannot indicate any correlations from these two regression models. In the last model, t_standardized_crime ~ t_standardized_home_security + t_standardized_healthcare, it suggests correlations between standardized crime, standardized healthcare, and standardized home security respectively. From the summary, we found that the p-value for both home security and healthcare are 0, so there is a correlation between crime, healthcare and home security. We also find the slope of home security is -1.133 and slope of healthcare is 0.9466. Since the R-square is 0.078, which means only 7.8% of data could be explained by this model, this regression model could not predict real-world crime rate well.
# get the intercept and slope of linear regression line between each factor and total crime index
for i in range(len(factors_index)):
string = "total_crime_index ~ " + factors_index[i] + " + " + factors_index[(i+1)%len(factors_index)]
print(string)
dependent, predictor = patsy.dmatrices(string, overall)
model = sm.OLS(dependent, predictor)
res_1 = model.fit()
print(res_1.summary())
total_crime_index ~ index_health_care + index_education
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.201
Model: OLS Adj. R-squared: 0.198
Method: Least Squares F-statistic: 55.16
Date: Mon, 14 Mar 2022 Prob (F-statistic): 4.30e-22
Time: 21:12:27 Log-Likelihood: -2205.0
No. Observations: 441 AIC: 4416.
Df Residuals: 438 BIC: 4428.
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 135.7375 5.757 23.579 0.000 124.423 147.052
index_health_care -0.5477 0.146 -3.750 0.000 -0.835 -0.261
index_education -0.0504 0.116 -0.435 0.664 -0.278 0.177
==============================================================================
Omnibus: 96.176 Durbin-Watson: 1.559
Prob(Omnibus): 0.000 Jarque-Bera (JB): 208.363
Skew: 1.139 Prob(JB): 5.68e-46
Kurtosis: 5.479 Cond. No. 517.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
total_crime_index ~ index_education + index_home_security_system_svcs
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.232
Model: OLS Adj. R-squared: 0.229
Method: Least Squares F-statistic: 66.33
Date: Mon, 14 Mar 2022 Prob (F-statistic): 6.87e-26
Time: 21:12:27 Log-Likelihood: -2196.2
No. Observations: 441 AIC: 4398.
Df Residuals: 438 BIC: 4411.
Df Model: 2
Covariance Type: nonrobust
===================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept 130.6878 5.306 24.632 0.000 120.260 141.115
index_education -0.0437 0.084 -0.519 0.604 -0.209 0.122
index_home_security_system_svcs -0.5177 0.091 -5.698 0.000 -0.696 -0.339
==============================================================================
Omnibus: 99.479 Durbin-Watson: 1.628
Prob(Omnibus): 0.000 Jarque-Bera (JB): 220.399
Skew: 1.167 Prob(JB): 1.38e-48
Kurtosis: 5.559 Cond. No. 482.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
total_crime_index ~ index_home_security_system_svcs + index_health_care
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.237
Model: OLS Adj. R-squared: 0.233
Method: Least Squares F-statistic: 67.87
Date: Mon, 14 Mar 2022 Prob (F-statistic): 2.10e-26
Time: 21:12:27 Log-Likelihood: -2195.0
No. Observations: 441 AIC: 4396.
Df Residuals: 438 BIC: 4408.
Df Model: 2
Covariance Type: nonrobust
===================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept 122.7855 6.303 19.481 0.000 110.398 135.173
index_home_security_system_svcs -0.8540 0.189 -4.528 0.000 -1.225 -0.483
index_health_care 0.3582 0.220 1.626 0.105 -0.075 0.791
==============================================================================
Omnibus: 99.102 Durbin-Watson: 1.661
Prob(Omnibus): 0.000 Jarque-Bera (JB): 216.152
Skew: 1.170 Prob(JB): 1.16e-47
Kurtosis: 5.508 Cond. No. 526.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We then calculate the regression models for index. In each regression, there is only one variable with 0 p-value and one variable whose p-value is larger than 5%. In specific, the p-value of health care in either model is 0, suggesting there is a correlation between health care index and crime rate index. Since none of the regression models have 0 p-values for both variables, there is little meaningful analysis in this part.
#get the intercept and slope of linear regression line between 3 spendings and total crime aggregate
string = "t_standardized_crime ~ " + factors_standardized[0] + " + " +\
factors_standardized[1] + " + " + factors_standardized[2]
print(string)
dependent, predictor = patsy.dmatrices(string, t_standardized_data)
model = sm.OLS(dependent, predictor)
res_1 = model.fit()
print(res_1.summary())
t_standardized_crime ~ t_standardized_healthcare + t_standardized_education + t_standardized_home_security
OLS Regression Results
================================================================================
Dep. Variable: t_standardized_crime R-squared: 0.098
Model: OLS Adj. R-squared: 0.092
Method: Least Squares F-statistic: 15.83
Date: Mon, 14 Mar 2022 Prob (F-statistic): 8.69e-10
Time: 21:12:27 Log-Likelihood: -603.01
No. Observations: 441 AIC: 1214.
Df Residuals: 437 BIC: 1230.
Df Model: 3
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
Intercept 1.18e-16 0.045 2.6e-15 1.000 -0.089 0.089
t_standardized_healthcare 1.8137 0.362 5.017 0.000 1.103 2.524
t_standardized_education -0.5789 0.188 -3.083 0.002 -0.948 -0.210
t_standardized_home_security -1.4491 0.249 -5.817 0.000 -1.939 -0.960
==============================================================================
Omnibus: 17.264 Durbin-Watson: 1.624
Prob(Omnibus): 0.000 Jarque-Bera (JB): 18.509
Skew: 0.490 Prob(JB): 9.57e-05
Kurtosis: 2.781 Cond. No. 16.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
While the p-value of the intercept in the regression model of t_standardized_crime ~ t_standardized_healthcare + t_standardized_education + t_standardized_home_security is equal to 1, this does not significantly contribute to the analysis of the statistics. However, since all other p-values are less than 5%, the model suggests that there exists a correlation between standardized crime and standardized healthcare, standardized education, and standardized home security. From the summary, the slope for the t_standardized_healthcare is 1.8137, indicating a positive correlation between the crime rate and health spending. Meanwhile, the slopes for the other two variables range from -0.5789 to -1.4491, indicating that there exists a negative correlation. Overall, since there exist both positive and negative correlations between the crime rate and the three variables, the model does not suggest a strong one-directional (negative) relationship between our variables. In addition, the R-square in this model is 0.098, meaning 1% of the data could be explained by these models. In conclusion, this linear regression model with three independent variables only suggests a very weak negative correlation between our independent variables and dependent variables.
#get the intercept and slope of linear regression line between 3 spendings index and total crime index
string = "total_crime_index ~ " + factors_index[0] + " + " +\
factors_index[1] + " + " + factors_index[2]
print(string)
dependent, predictor = patsy.dmatrices(string, overall)
model = sm.OLS(dependent, predictor)
res_1 = model.fit()
print(res_1.summary())
total_crime_index ~ index_health_care + index_education + index_home_security_system_svcs
OLS Regression Results
==============================================================================
Dep. Variable: total_crime_index R-squared: 0.248
Model: OLS Adj. R-squared: 0.243
Method: Least Squares F-statistic: 48.02
Date: Mon, 14 Mar 2022 Prob (F-statistic): 7.66e-27
Time: 21:12:27 Log-Likelihood: -2191.7
No. Observations: 441 AIC: 4391.
Df Residuals: 437 BIC: 4408.
Df Model: 3
Covariance Type: nonrobust
===================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept 119.9621 6.359 18.865 0.000 107.464 132.460
index_health_care 0.9732 0.325 2.999 0.003 0.335 1.611
index_education -0.3174 0.124 -2.567 0.011 -0.560 -0.074
index_home_security_system_svcs -1.0730 0.206 -5.211 0.000 -1.478 -0.668
==============================================================================
Omnibus: 103.188 Durbin-Watson: 1.666
Prob(Omnibus): 0.000 Jarque-Bera (JB): 234.572
Skew: 1.198 Prob(JB): 1.16e-51
Kurtosis: 5.651 Cond. No. 696.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We then run the linear regressions on the crime rate index with index_health_care, index_education, and index_home_security. Compared to the linear regression model for standardized variables, R-squared is 15% larger, which means our index model can represent 15% more of the actual data. We found that the p-value for all three slopes are approximately 0, meaning that we are strongly confident on our predicted slopes. We could see that index_education and index_home_security have a weak negative relationship with crime index, while index_health_care has a weak positive relationship with the crime index. A one unit increase of index_education will decrease crime index by 0.3174 on average; a one unit increase of index_home_security will decrease crime index by 1.0730 on average; a one unit increase of index_health_care will increase crime index by 0.9732 on average. Though the model is statistically and economically significant, the weak relationships with either positive or negative signs between our independent variables and dependent variables indicate that there might not be a strong negative correlation between them.
From all of the above linear regression models, we have found both positive and negative correlations between our independent and dependent variables, with the slope ranging from -1.4491 to 1.8137. Since all observed slopes are relatively close to 0, the models fail to suggest a strong one-directional relationship. Besides, the R-squared values in all linear regression models are extremely small. Therefore, only applying the linear regression model is not sufficient to explain the actual relationship between the variables of our interest. That being said, for further analysis on the topic, we decide to apply the sklearn to get a more precise prediction on our data. Specifically, we will use linear regression, radial basis function from support vector regression, polynomial from support vector regression, ridge regression, and Poisson regression to predict the crime rate.
In the following scripts, we first split the data into training and testing sets, 80% and 20% respectively. We then use the SVM command to train the data using different regression models, including linear regression, RBF regression, polynomial regression, ridge regression, and Poisson regression. We plotted three blocks of regression models (with independent variables as index_heath_care, index_education, index_home_security_system_svcs), each block with plots on both the training set and testing set.
# decide the proportion of training and testing data
svm_df = df[["index_health_care", "index_education", "index_home_security_system_svcs", "total_crime_index"]]
svm_df_test = pd.DataFrame()
svm_df_test = svm_df[(int(len(svm_df) * 0.8)):]
svm_df = svm_df[:(int(len(svm_df) * 0.8))]
# implement the function which get the accuracy of the test result
def get_score(clf, x_test, y_test, X, y):
y_predict = clf.predict(x_test)
scores = cross_val_score(clf, X, y, cv=10)
print("Accuracy Report:")
print("\t%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
print("\tR2 Score: %f0.2" % r2_score(y_test, y_predict))
print("\tMean Square Error: %f0.2" % mean_squared_error(y_test, y_predict))
# using svm from sklearn to train the data using Linear Regression, RBF Regression,
# Polynomial Regression, Ridge Regression, and PossionRegression, and this function is for
# compare each spending with total crime index
def one_one_compare(s):
X = svm_df['%s' % s].values.reshape(-1,1)
y = svm_df['total_crime_index'].values
model1 = LinearRegression()
lr = model1.fit(X, y)
model2 = SVR(kernel='rbf', C=10, epsilon=10)
svr_rbf = model2.fit(X, y)
model3 = SVR(kernel='poly', degree=2, C=10, epsilon=10)
svr_poly = model3.fit(X, y)
model4 = Ridge(alpha=1.0)
ridge = model4.fit(X, y)
model5 = PoissonRegressor(alpha=1.0)
poisson = model5.fit(X, y)
x_range = np.linspace(X.min(), X.max(), 100)
# predict the training data
y_lr = model1.predict(x_range.reshape(-1, 1))
y_svr = model2.predict(x_range.reshape(-1, 1))
y_poly = model3.predict(x_range.reshape(-1, 1))
y_ridge = model4.predict(x_range.reshape(-1, 1))
y_poisson = model5.predict(x_range.reshape(-1, 1))
fig = px.scatter(df, x=svm_df['%s' % s], y=svm_df['total_crime_index'],
opacity=0.8, color_discrete_sequence=['black'])
# adding the data points into scatter plot
fig.add_traces(go.Scatter(x=x_range, y=y_lr, name='Linear Regression', line=dict(color='green')))
fig.add_traces(go.Scatter(x=x_range, y=y_svr, name='Support Vector Regression - RBF', line=dict(color='red')))
fig.add_traces(go.Scatter(x=x_range, y=y_poly, name='Support Vector Regression - Poly', line=dict(color='blue')))
fig.add_traces(go.Scatter(x=x_range, y=y_ridge, name='Ridge Regression', line=dict(color='yellow')))
fig.add_traces(go.Scatter(x=x_range, y=y_poisson, name='Poisson Regression', line=dict(color='pink')))
fig.update_layout(dict(plot_bgcolor = 'white'))
# drawing the regression line
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_layout(title=dict(text="Spending vs. Crime, Generating Prediction (epsilon=10, C=10, index)",
font=dict(color='black')))
fig.update_traces(marker=dict(size=3))
fig.show()
X = svm_df_test['%s' % s].values.reshape(-1,1)
y = svm_df_test['total_crime_index'].values
x_range = np.linspace(X.min(), X.max(), 100)
# predict the training data
y_lr = model1.predict(x_range.reshape(-1, 1))
y_svr = model2.predict(x_range.reshape(-1, 1))
y_poly = model3.predict(x_range.reshape(-1, 1))
y_ridge = model4.predict(x_range.reshape(-1, 1))
y_poisson = model5.predict(x_range.reshape(-1, 1))
fig = px.scatter(df, x=svm_df_test['%s' % s], y=svm_df_test['total_crime_index'],
opacity=0.8, color_discrete_sequence=['black'])
# adding the data points into scatter plot
fig.add_traces(go.Scatter(x=x_range, y=y_lr, name='Linear Regression', line=dict(color='green')))
fig.add_traces(go.Scatter(x=x_range, y=y_svr, name='Support Vector Regression - RBF', line=dict(color='red')))
fig.add_traces(go.Scatter(x=x_range, y=y_poly, name='Support Vector Regression - Poly', line=dict(color='blue')))
fig.add_traces(go.Scatter(x=x_range, y=y_ridge, name='Ridge Regression', line=dict(color='yellow')))
fig.add_traces(go.Scatter(x=x_range, y=y_poisson, name='Poisson Regression', line=dict(color='pink')))
fig.update_layout(dict(plot_bgcolor = 'white'))
# drawing the regression line
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_layout(title=dict(text="Spending vs. Crime, Testing Prediction (epsilon=10, C=10, index)",
font=dict(color='black')))
fig.update_traces(marker=dict(size=3))
fig.show()
# using the function get_score to print out the accuracy of each model
get_score(lr, svm_df_test['%s' % s].values.reshape(-1,1), svm_df_test['total_crime_index'].values, X, y)
get_score(svr_rbf, svm_df_test['%s' % s].values.reshape(-1,1), svm_df_test['total_crime_index'].values, X, y)
get_score(svr_poly, svm_df_test['%s' % s].values.reshape(-1,1), svm_df_test['total_crime_index'].values, X, y)
get_score(ridge, svm_df_test['%s' % s].values.reshape(-1,1), svm_df_test['total_crime_index'].values, X, y)
get_score(poisson, svm_df_test['%s' % s].values.reshape(-1,1), svm_df_test['total_crime_index'].values, X, y)
one_one_compare('index_health_care')
Accuracy Report: -0.15 accuracy with a standard deviation of 0.27 R2 Score: -0.2862860.2 Mean Square Error: 1007.7324410.2 Accuracy Report: -0.11 accuracy with a standard deviation of 0.12 R2 Score: -0.2034810.2 Mean Square Error: 942.8593060.2 Accuracy Report: -0.22 accuracy with a standard deviation of 0.32 R2 Score: -0.2182920.2 Mean Square Error: 954.4629750.2 Accuracy Report: -0.15 accuracy with a standard deviation of 0.27 R2 Score: -0.2862830.2 Mean Square Error: 1007.7307020.2 Accuracy Report: -0.16 accuracy with a standard deviation of 0.26 R2 Score: -0.2380240.2 Mean Square Error: 969.9219580.2
For the plot of index_heath_care vs. index_crime, we found that while all regression models fail to represent our training data perfectly due to the large spread of the data points, the RBF regression model fits our training data the best among all. Besides, all regression models display a clear decreasing trend, indicating that as the spending on health care increases, the crime rate decreases. Although this observation matches what we observed in the OLS regression results (a very weak negative relationship between independent and dependent variables), it is obvious that all of the regression models have low accuracy on the training data. In a similar manner, if we take a closer look at the regression model plot on the testing data, we can observe that again all regression models display a clear decreasing trend, indicating that as the spending on health care increases, the crime rate decreases. Nevertheless, since the testing set only contains 20% of the data, the data points in the plot are even more spread out, leading to an even less accurate prediction of our regression models. Apart from that, the Accuracy Report below the plots states that the accuracy scores for all regression models are negative, ranging from -0.22 to -0.11, whereas the R2 scores are also negative, ranging from approximately -0.29 to -0.20. Even worse, the mean squared error values for all regression models are also extremely high (around 1,000). As a result, since both the training models and testing models are inaccurate (with accuracy, R2 scores below 0, and mean squared error about 1,000), we have concluded that there does not exist a relationship between health care spending and crime rate.
one_one_compare('index_education')
Accuracy Report: -0.13 accuracy with a standard deviation of 0.27 R2 Score: -0.2505860.2 Mean Square Error: 979.7639550.2 Accuracy Report: -0.21 accuracy with a standard deviation of 0.28 R2 Score: -0.1918270.2 Mean Square Error: 933.7293440.2 Accuracy Report: -0.17 accuracy with a standard deviation of 0.33 R2 Score: -0.1162480.2 Mean Square Error: 874.5179120.2 Accuracy Report: -0.13 accuracy with a standard deviation of 0.27 R2 Score: -0.2505850.2 Mean Square Error: 979.7631350.2 Accuracy Report: -0.12 accuracy with a standard deviation of 0.25 R2 Score: -0.2476400.2 Mean Square Error: 977.4559510.2
For the plot of index_education vs. index_crime, we found that while all regression models fail to represent our training data perfectly due to the large spread of the data points, the RBF regression model fits our training data the best among all. Besides, all regression models display a clear decreasing trend, indicating that as the spending on education increases, the crime rate decreases. Although this observation matches what we observed in the OLS regression results (a very weak negative relationship between independent and dependent variables), it is obvious that all of the regression models have low accuracy on the training data. In a similar manner, if we take a closer look at the regression model plot on the testing data, we can observe that again all regression models display a clear decreasing trend, indicating that as the spending on education increases, the crime rate decreases. Nevertheless, since the testing set only contains 20% of the data, the data points in the plot are even more spread out, leading to an even less accurate prediction of our regression models. Apart from that, the Accuracy Report below the plots states that the accuracy scores for all regression models are negative, ranging from -0.21 to -0.12, whereas the R2 scores are also negative, ranging from approximately -0.25 to -0.11. Even worse, the mean squared error values for all regression models are also extremely high (around 1,000). As a result, since both the training models and testing models are inaccurate (with accuracy, R2 scores below 0, and mean squared error about 1,000), we have concluded that there does not exist a relationship between education spending and crime rate.
one_one_compare('index_home_security_system_svcs')
Accuracy Report: -0.14 accuracy with a standard deviation of 0.26 R2 Score: -0.3513750.2 Mean Square Error: 1058.7261290.2 Accuracy Report: -0.12 accuracy with a standard deviation of 0.20 R2 Score: -0.2375770.2 Mean Square Error: 969.5718770.2 Accuracy Report: -0.20 accuracy with a standard deviation of 0.32 R2 Score: -0.3435710.2 Mean Square Error: 1052.6119560.2 Accuracy Report: -0.14 accuracy with a standard deviation of 0.26 R2 Score: -0.3513730.2 Mean Square Error: 1058.7243940.2 Accuracy Report: -0.14 accuracy with a standard deviation of 0.24 R2 Score: -0.2927930.2 Mean Square Error: 1012.8305050.2
For the plot of index_home_security_system_svcs vs. index_crime, we found that while all regression models fail to represent our training data perfectly due to the large spread of the data points, the RBF regression model fits our training data the best among all. Besides, all regression models display a clear decreasing trend, indicating that as the spending on home security system services increases, the crime rate decreases. Although this observation matches what we observed in the OLS regression results (a very weak negative relationship between independent and dependent variables), it is obvious that all of the regression models have low accuracy on the training data. In a similar manner, if we take a closer look at the regression model plot on the testing data, we can observe that again all regression models display a clear decreasing trend, indicating that as the spending on home security system services increases, the crime rate decreases. Nevertheless, since the testing set only contains 20% of the data, the data points in the plot are even more spread out, leading to an even less accurate prediction of our regression models. Apart from that, the Accuracy Report below the plots states that the accuracy scores for all regression models are negative, ranging from -0.20 to -0.12, whereas the R2 scores are also negative, ranging from approximately -0.35 to -0.24. Even worse, the mean squared error values for all regression models are also extremely high (around 1,000). As a result, since both the training models and testing models are inaccurate (with accuracy, R2 scores below 0, and mean squared error about 1,000), we have concluded that there does not exist a relationship between home security system services spending and crime rate.
# using svm from sklearn to train the data using Linear Regression, RBF Regression,
# Polynomial Regression, Ridge Regression, and PossionRegression, and this function is for
# compare two of the spendings with total crime index
def two_one_compare(s1, s2):
fig = px.scatter_3d(svm_df,
x=svm_df['%s' % s1],
y=svm_df['%s' % s2],
z=svm_df['total_crime_index'],
opacity=0.8, color_discrete_sequence=['black'],
height=900, width=900)
fig.update_layout(title_text="Scatter 3D Plot",
scene_camera_eye=dict(x=1.5, y=1.5, z=0.25),
scene_camera_center=dict(x=0, y=0, z=-0.2),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'
),
zaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey')))
fig.update_traces(marker=dict(size=2))
fig.show()
X=svm_df[['%s' % s1,'%s' % s2]]
y=svm_df['total_crime_index'].values
# declare 5 models to train the data
model1 = LinearRegression()
lr = model1.fit(X, y)
model2 = SVR(kernel='rbf', C=10, epsilon=1)
svr_rbf = model2.fit(X, y)
model3 = SVR(kernel='poly', degree=2, C=10, epsilon=1)
svr_poly = model3.fit(X, y)
model4 = Ridge(alpha=1.0)
ridge = model4.fit(X, y)
model5 = PoissonRegressor(alpha=1.0)
poisson = model5.fit(X, y)
mesh_size = 0.5
x_min, x_max = X['%s' % s1].min(), X['%s' % s1].max()
y_min, y_max = X['%s' % s2].min(), X['%s' % s2].max()
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# predicting the training data in linear regression
pred_lr = model1.predict(np.c_[xx.ravel(), yy.ravel()])
pred_lr = pred_lr.reshape(xx.shape)
# predicting the training data in RBF regression
pred_svr_rbf = model2.predict(np.c_[xx.ravel(), yy.ravel()])
pred_svr_rbf = pred_svr_rbf.reshape(xx.shape)
# predicting the training data in polynomial regression
pred_svr_poly = model3.predict(np.c_[xx.ravel(), yy.ravel()])
pred_svr_poly = pred_svr_poly.reshape(xx.shape)
# predicting the training data in ridge regression
pred_ridge = model4.predict(np.c_[xx.ravel(), yy.ravel()])
pred_ridge = pred_ridge.reshape(xx.shape)
# predicting the training data in poisson regression
pred_poisson = model5.predict(np.c_[xx.ravel(), yy.ravel()])
pred_poisson = pred_poisson.reshape(xx.shape)
fig = px.scatter_3d(svm_df, x=svm_df['%s' % s1],
y=svm_df['%s' % s2],
z=svm_df['total_crime_index'],
opacity=0.8, color_discrete_sequence=['black'],
height=900, width=900)
fig.update_layout(title_text="Scatter 3D Plot with Regression Prediction Surfaces, Generating Prediction",
scene_camera_eye=dict(x=1.5, y=1.5, z=0.25),
scene_camera_center=dict(x=0, y=0, z=-0.2),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'
),
zaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey')))
fig.update_traces(marker=dict(size=2))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_lr, name='lr',
colorscale=px.colors.sequential.Greens, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_svr_rbf, name='rbf',
colorscale=px.colors.sequential.Reds, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_svr_poly, name='poly',
colorscale=px.colors.sequential.Blues, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_ridge, name='ridge',
colorscale=px.colors.sequential.YlOrBr, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_poisson, name='poisson',
colorscale=px.colors.sequential.Purpor, showscale=False))
fig.update_traces(showlegend=True, selector=dict(type='surface'))
fig.show()
# prediction model
X=svm_df_test[['%s' % s1,'%s' % s2]]
y=svm_df_test['total_crime_index'].values
x_min, x_max = X['%s' % s1].min(), X['%s' % s1].max()
y_min, y_max = X['%s' % s2].min(), X['%s' % s2].max()
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# predicting the training data in linear regression
pred_lr = model1.predict(np.c_[xx.ravel(), yy.ravel()])
pred_lr = pred_lr.reshape(xx.shape)
# predicting the training data in RBF regression
pred_svr_rbf = model2.predict(np.c_[xx.ravel(), yy.ravel()])
pred_svr_rbf = pred_svr_rbf.reshape(xx.shape)
# predicting the training data in polynomial regression
pred_svr_poly = model3.predict(np.c_[xx.ravel(), yy.ravel()])
pred_svr_poly = pred_svr_poly.reshape(xx.shape)
# predicting the training data in ridge regression
pred_ridge = model4.predict(np.c_[xx.ravel(), yy.ravel()])
pred_ridge = pred_ridge.reshape(xx.shape)
# predicting the training data in poisson regression
pred_poisson = model5.predict(np.c_[xx.ravel(), yy.ravel()])
pred_poisson = pred_poisson.reshape(xx.shape)
fig = px.scatter_3d(svm_df, x=svm_df_test['%s' % s1],
y=svm_df_test['%s' % s2],
z=svm_df_test['total_crime_index'],
opacity=0.8, color_discrete_sequence=['black'],
height=900, width=900)
fig.update_layout(title_text="Scatter 3D Plot with Regression Prediction Surfaces, Prediction Results",
scene_camera_eye=dict(x=1.5, y=1.5, z=0.25),
scene_camera_center=dict(x=0, y=0, z=-0.2),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'
),
zaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey')))
fig.update_traces(marker=dict(size=2))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_lr, name='lr',
colorscale=px.colors.sequential.Greens, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_svr_rbf, name='rbf',
colorscale=px.colors.sequential.Reds, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_svr_poly, name='poly',
colorscale=px.colors.sequential.Blues, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_ridge, name='ridge',
colorscale=px.colors.sequential.YlOrBr, showscale=False))
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_poisson, name='poisson',
colorscale=px.colors.sequential.Purpor, showscale=False))
fig.update_traces(showlegend=True, selector=dict(type='surface'))
fig.show()
# using the get_score to print out the accuracy of model
get_score(lr, svm_df_test[['%s' % s1,'%s' % s2]], svm_df_test['total_crime_index'].values, X, y)
get_score(svr_rbf, svm_df_test[['%s' % s1,'%s' % s2]], svm_df_test['total_crime_index'].values, X, y)
get_score(svr_poly, svm_df_test[['%s' % s1,'%s' % s2]], svm_df_test['total_crime_index'].values, X, y)
get_score(ridge, svm_df_test[['%s' % s1,'%s' % s2]], svm_df_test['total_crime_index'].values, X, y)
get_score(poisson, svm_df_test[['%s' % s1,'%s' % s2]], svm_df_test['total_crime_index'].values, X, y)
Similar to previous analysis, we assign two of the variables in [healthcare, education, and home security] and build regression models based on combinations of two independent variables. Similarly, we split our data into 80% as training data and 20% as testing data. For convenience, we wrote a function to show the 3d distribution of data, training data with its prediction line, and testing data with its prediction results.
two_one_compare('index_health_care','index_education')
Accuracy Report: -0.16 accuracy with a standard deviation of 0.28 R2 Score: -0.2885680.2 Mean Square Error: 1009.5208800.2 Accuracy Report: -0.19 accuracy with a standard deviation of 0.20 R2 Score: -0.2374250.2 Mean Square Error: 969.4530770.2 Accuracy Report: -0.20 accuracy with a standard deviation of 0.36 R2 Score: -0.2363170.2 Mean Square Error: 968.5851850.2 Accuracy Report: -0.16 accuracy with a standard deviation of 0.28 R2 Score: -0.2885640.2 Mean Square Error: 1009.5171750.2 Accuracy Report: -0.16 accuracy with a standard deviation of 0.27 R2 Score: -0.2408060.2 Mean Square Error: 972.1021030.2
We first analyze the relationship of index health care and index education. From the first graph, we could see the distribution of data is randomly separated. From the second graph, we found that our regression prediction surfaces, though aligned with a lot of data, there are still large numbers of data not on the surfaces. Similarly, in the last graph, though the number of data not on the surfaces decreases, the total number of data presented on the graph decreases as well because we only present 20% of testing data in the third graph. The accuracy report coordinates with our conclusion since each model has an R2 score that is all below 0 with huge mean squared error (around 1000) and low accuracy. Thus, we claim that there does not exist a relationship between index health care and index education.
two_one_compare('index_education','index_home_security_system_svcs')
Accuracy Report: -0.17 accuracy with a standard deviation of 0.27 R2 Score: -0.3503450.2 Mean Square Error: 1057.9195290.2 Accuracy Report: -0.22 accuracy with a standard deviation of 0.30 R2 Score: -0.3043640.2 Mean Square Error: 1021.8955110.2 Accuracy Report: -0.18 accuracy with a standard deviation of 0.35 R2 Score: -0.2674460.2 Mean Square Error: 992.9727770.2 Accuracy Report: -0.17 accuracy with a standard deviation of 0.27 R2 Score: -0.3503420.2 Mean Square Error: 1057.9169570.2 Accuracy Report: -0.16 accuracy with a standard deviation of 0.26 R2 Score: -0.2945820.2 Mean Square Error: 1014.2322810.2
We then analyze the relationship of index home security and index education. From the first graph, we could see the distribution of data is randomly separated. From the second graph, we found that our regression prediction surfaces, though aligned with a lot of data, there are still large numbers of data not on the surfaces. Similarly, in the last graph, though the number of data not on the surfaces decreases, the total number of data presented on the graph decreases as well because we only present 20% of testing data in the third graph. The accuracy report coordinates with our conclusion since each model has an R2 score that is all below 0 with huge mean squared error (around 1000) and low accuracy. Thus, we claim that there does not exist a relationship between index home security and index education.
two_one_compare('index_home_security_system_svcs','index_health_care')